12 research outputs found

    BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

    Full text link
    Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1

    Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

    Full text link
    Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes 6-LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper arXiv:1806.0886

    RePAiR: A Strategy for Reducing Peak Temperature while Maximising Accuracy of Approximate Real-Time Computing: Work-in-Progress

    Get PDF
    Improving accuracy in approximate real-time computing without violating thermal-energy constraints of the underlying hardware is a challenging problem. The execution of approximate real-time tasks can individually be bifurcated into two components: (i) execution of the mandatory part of the task to obtain a result of acceptable quality, followed by (ii) partial/complete execution of the optional part, which refines the initially obtained result, to increase the accuracy without violating the temporal-deadline. This paper introduces RePAiR, a novel task-allocation strategy for approximate real-time applications, combined with fine-grained DVFS and on-line task migration of the cores and power-gating of the last level cache, to reduce chip-temperature while respecting both deadline and thermal constraints. Furthermore, gained thermal benefits can be traded against system-level accuracy by extending the execution-time of the optional part

    Delay-on-Squash : Stopping Microarchitectural Replay Attacks in Their Tracks

    No full text
    MicroScope and other similar microarchitectural replay attacks take advantage of the characteristics of speculative execution to trap the execution of the victim application in a loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack software that are shielded against replay, even under conditions where a side-channel attack would not be possible (e.g., in secure enclaves). At the same time, unlike speculative side-channel attacks, microarchitectural replay attacks can be used to amplify the correct path of execution, rendering many existing speculative side-channel defenses ineffective. In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defense against them. We make the observation that such attacks rely on repeated squashes of so-called "replay handles" and that the instructions causing the side-channel must reside in the same reorder buffer window as the handles. We propose Delay-on-Squash, a hardware-only technique for tracking squashed instructions and preventing them from being replayed by speculative replay handles. Our evaluation shows that it is possible to achieve full security against microarchitectural replay attacks with very modest hardware requirements while still maintaining 97% of the insecure baseline performance

    Twig: Multi-agent task management for colocated latency-critical cloud services

    Get PDF
    Many of the important services running on data centres are latency-critical, time-varying, and demand strict usersatisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.This work was funded by the European Union under grant agreement No 754337 (EuroEXA), the Brazilian federal government under CNPq grant (Process no 430188/2018-8), and the Swedish Research Council under grant 2015-05159. The 11 experiments were conducted on the NTNU EPIC computing infrastructure and support by NTNU’s HPC group.Peer ReviewedPostprint (author's final draft

    ARCTIC: Approximate Real-Time Computing in a Cache-Conscious Multicore Environment

    No full text
    Improving result-accuracy in approximate computing (AC) based time-critical systems, without violating power constraints of the underlying circuitry, is gradually becoming challenging with the rapid progress in technology scaling. The execution span of each AC real-time tasks can be split into a couple of parts: (i) the mandatory part, execution of which offers a result of acceptable quality, followed by (ii) the optional part, which can be executed partially or completely to refine the initially obtained result in order to increase the result-accuracy, while respecting the time-constraint. In this article, we introduce a novel hybrid offline-online scheduling strategy, for AC real-time tasks. The goal of real-time scheduler of is to maximise the results-accuracy (QoS) of the task-set with opportunistic shedding of the optional part, while respecting system-wide constraints. During execution, retains exclusive copy of the private cache blocks only in the local caches in a multi-core system and no copies of these blocks are maintained at the other caches, and improves performance (i.e., reduces execution-time) by accumulating more live blocks on-chip. Combining offline scheduling with the online cache optimization improves both QoS and energy efficiency. While surpassing prior arts, our proposed strategy reduces the task-rejection-rate by up to 25%, whereas enhances QoS by 10%, with an average energy-delay-product gain of up to 9.1%, on an 8-core system

    Logic Circuit

    Full text link
    publication date: 2017-12-12; filing date: 2016-05-0
    corecore